Maximum likelihood estimation

Definition

Maximum likelihood estimation (MLE) is a popular mechanism which is used to estimate the model parameters of a regression model.
In other words, it always first compute the probability("how likely") of a specific data point given a distribution/model, and then seeks for the distribution/model parameters that can maximize the total probabilities of all data points.

Actually, what is likelihood?
A likelihood function is numerically equal to a conditional probability, but is always a function of the variable after the “|” sign.

Implementation

I put MLE applications into two categories:

Easy and intuitive: to estimate the parameters of a distribution

The key steps are:
1. formulate PDF of the distribution $P (x_{i} | θ)$
e.g.
$$ P(x_i|\theta) = \mathcal{N}(x_i,\mu,\sigma) = \frac{1}{\sqrt {2 \pi }\sigma} e^{ - \frac{(x-\mu)^2}{2\sigma
{ #2}
}}$$
2. compute the product of the likelihood of all data points $L (θ)$
$$L(\theta)=L(\mu, \sigma) = P(X|\mu,\sigma)=\prod_i^n \mathcal{N}(x_i,\mu,\sigma)$$
More analytic processes: to estimate the parameters of a model

The key steps are:
1. formulate the conditional probability $P (y_{i} | x_{i}, t h e t a)$
$$P(y_i|x_i, theta) = f(x_i) $$
2. compute the product of the likelihood of all y values $L (θ)$
$$L(\theta) = P(Y|X, \theta) = \prod_i^n P(y_i|x_i, theta)$$

Examples

Distributions

Normal distribution

formulate PDF of the distribution $P (x_{i} | θ)$

P(x_i|\theta) = \mathcal{N}(x_i,\mu,\sigma) = \frac{1}{\sqrt {2 \pi }\sigma} e^{ - \frac{(x-\mu)^2}{2\sigma {#2} }}

compute the product of the likelihood of all data points $L (θ)$

L (θ) = L (μ, σ) = P (X | μ, σ) = \prod_{i}^{n} N (x_{i}, μ, σ)

Binomial distribution

formulate PDF of the distribution $P (x_{i} | θ)$

p_{i} = b (i; n, p) = (\binom{n}{i}) p^{i} (1 - p)^{n - i}

compute the product of the likelihood of all data points $L (θ)$

L (θ) = L (p) = P (X | p) = \prod_{i}^{n} (\binom{n}{i}) p^{i} (1 - p)^{n - i}

Regressions

Linear regression

formulate the conditional probability $P (y_{i} | x_{i}, t h e t a)$
for a linear regression, we have:

\hat{y} = f (x) = θ x + η

e.g., if the noise idd. a normal distribution: $η \sim N (0, σ^{2})$
then the y idd. also a normal distribution: $η \sim N (θ x, σ^{2})$

P (y_{i} | x_{i}, t h e t a) = \frac{1}{\sqrt{2 π σ^{2}}} e^{- \frac{1}{2 σ^{2}} (y_{i} - θ x_{i})^{2}}

compute the product of the likelihood of all y values $L (θ)$

L (θ) = P (Y | X, θ) = \prod_{i}^{n} \frac{1}{\sqrt{2 π σ^{2}}} e^{- \frac{1}{2 σ^{2}} (y_{i} - θ x_{i})^{2}}

Logistic regression (here using categorical data)

formulate the conditional probability $P (y_{i} | x_{i}, t h e t a)$
for a logitstic function representing categorical data ("A" or "B"), we have probability of "A":

\hat{y} = f (x) = \frac{1}{1 + e^{- b (x + c)}}

P (y_{i} = c a t e g o r y A | x_{i}, t h e t a) = {\hat{y}}_{i}

P (y_{i} = c a t e g o r y B | x_{i}, t h e t a) = 1 - {\hat{y}}_{i}

compute the product of the likelihood of all y values $L (θ)$

L (θ) = P (Y | X, t h e t a) = \prod_{i}^{n} {\hat{y}}^{y_{i}} (1 - \hat{y})^{1 - y_{i}}